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ABSTRACT 

Summary: End-to-end next -generation sequencing microbiology data 
analysis requires a diversity of tools covering bacterial resequencing, 
de novo assembly, scaffolding, bacterial RNA-Seq, gene annotation 
and metagenomics. However, the construction of computational pipe- 
lines that use different software packages is difficult owing to a lack of 
interoperability, reproducibility and transparency. To overcome these 
limitations we present Orione, a Galaxy-based framework consisting 
of publicly available research software and specifically designed 
pipelines to build complex, reproducible workflows for next-generation 
sequencing microbiology data analysis. Enabling microbiology 
researchers to conduct their own custom analysis and data manipu- 
lation without software installation or programming, Orione provides 
new opportunities for data-intensive computational analyses in micro- 
biology and metagenomics. 

Availability and implementation: Orione is available online at http:// 
orione.crs4.it. 

Contact: gianmauro.cuccuru@crs4.it 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Application of next-generation sequencing (NGS) in microbiol- 
ogy is becoming a common practice with a profound impact on 
research, diagnostic and clinical microbiology (Loman et al., 
2012). Recent applications include genomic sequencing, differ- 
ential transcription analysis, variant investigation, as well as 
metagenomics studies. Major challenges include draft assemblies 
finishing followed by reliable genome annotation or robust 
dissection of microbial communities including those associated 
with human health and disease. Furthermore, there is an increas- 
ing need to process and present data in a fashion that is transpar- 
ent and reproducible and to provide analysis frameworks that 
are usable and cost-effective for biomedical researchers. 

To address these challenges, we developed Orione, an online 
framework for integrative analysis of NGS microbiology data. 
Orione is based on Galaxy (Goecks et al., 2010), an open plat- 
form for reproducible data-intensive computational analysis used 
in many diverse biomedical research environment. Orione is the 
first freely available platform that supports the whole life cycle of 
microbiology research data from production and annotation to 
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publication and sharing. Other commercial alternative exists 
(e.g. CLC Genomics Workbench by CLC Bio), but Orione is 
unique in transparently combining the most used open source 
bioinformatics tools for microbiology. Orione is currently 
applied to a variety of microbiological projects including bacteria 
resequencing, de novo assembling and microbiome investigations; 
see http://goo.gl/DwbgPD for a list. Furthermore, Orione is part 
of an ongoing project to integrate Galaxy with Hadoop-based 
tools to provide scalable computing (Leo et ah, 2012); a specia- 
lized version of OMERO (Allan et al., 2012) to model biomedical 
data and the chain of actions that connect them; and iRODS 
(Rajasekar et ah, 2010) to efficiently support inter-institutional 
data sharing. This infrastructure is already used in production at 
Center for Advanced Studies, Research and Development in 
Sardinia for the automated processing of sequencing data 
(Pireddu et al., 2013) and for quality control in gene therapy 
applications (Biffi et ah, 2013). 



2 FEATURES AND METHODS 

Orione consists of 'best-of-breed' NGS bioinformatics tools cov- 
ering end-to-end data analysis for bacterial resequencing, de novo 
assembly, scaffolding, bacterial RNA-Seq, gene annotation, 
metagenomics and metatranscriptomics. Publicly available re- 
search tools were integrated under the open source Galaxy 
framework with pipelines and workflows newly developed by 
our group for ready-to-go microbiological analysis. Although 
several of the tools for NGS microbiology data analysis were 
already available in Galaxy, a significant effort was required to 
expand the Galaxy functionalities with new features such as 
SSPACE (Boetzer et al, 2011), SSAKE (Warren et al., 2007), 
SOPRA (Dayarian et al, 2010), SEQuel (Ronen et al., 2012), 
EDGE-pro (Magoc et al., 2013), Gene Locator and Interpolated 
Markov ModelER (Delcher et al., 2007) and Prokka (http://goo. 
gl/aSuHb). We refer to the Supplementary information for a 
description of the complete set of Orione tools and workflows. 



3 FUNCTIONALITIES 

Orione complements the flexible Galaxy workflow environment, 
allowing microbiologists without any specific hardware or in- 
formatics skill to consistently access a set of NGS data analysis 
tools and conduct reproducible data-intensive computational 
analyses from quality control to microbial gene annotation. In 
the following paragraphs, we describe the main Orione 
functionalities. 
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Preprocessing, quality control and trimming. The funda- 
mental step before any NGS analysis is the quality control of 
reads and their trimming. To cope with long reads and paired- 
end technology, FastX (http://goo.gl/GxqyV) and FASTQC 
(http://goo.gl/6TUqD) were complemented with specifically de- 
veloped tools (see also workflow #1 in the Supplementary 
information). 

Reads mapping. Mapping is a key step in many NGS applica- 
tions from bacteria resequencing to variant calling. The most 
widely used aligners are integrated in Orione, including BWA 
(Li and Durbin, 2009), Bowtiel (Langmead et al, 2009), 
Bowtie2 (Langmead and Salzberg, 2012), SOAP (Li et al, 
2008) and MOSAIK (http://git.io/QrYWXg). We further 
added BLAT (Kent, 2002), SHRiMP (David et al, 2011), 
LASTZ (Harris, 2007) and BFAST (Homer et al, 2009) for 
use with long reads from 454 Roche. 

De novo assembly. De novo assembly produces contigs without 
the aid of a reference genome. Different methods, either based on 
a de Bruijn graph [Velvet (Zerbino and Birney, 2008), ABySS 
(Simpson et al, 2009) and SPAdes (Bankevich et al, 2012)] or on 
a greedy approach [SSAKE, Edena (Hernandez et al, 2008)], are 
available in Orione. 

Scaffolding. After mapping, contigs are ordered and oriented 
to produce even longer sequences called scaffolds, exploiting the 
mate-pair/paired-end information. Orione includes the most es- 
tablished scaffolders such as SSAKE, SSPACE, SEQuel and 
SOPRA. 

Post assembly, contigs statistics, (multi) aligning and variant 
calling. This section of Orione includes tools we have developed 
covering task such as genome-scale alignment, high-quality con- 
tigs extraction, statistics over contigs or draft genomes (N50/ 
NG50 values, contigs length distribution, high/low quality 
regions/gaps in draft genomes). 

Annotation. Annotation is the process of identifying meaning- 
ful biological information from sequences. Glimmer and 
tRNAscan-SE (Lowe and Eddy, 1997) were wrapped into 
Orione together with the Prokka pipeline, enabling easy 
Genbank/DDJB/ENA submission. 

RNA-Seq. We integrated EDGE-pro tool for bacterial RNA- 
Seq analysis. As EDGE-pro requires genome annotation files, we 
developed an accessory tool ('Get EDGE-pro files') that retrieves 
them directly from the NCBI RefSeq repository. 

Metagenomics and other tools. We added to the standard 
Galaxy metagenomics pipeline MetaPhlAn (Segata et al, 2012) 
and MetaVelvet (Namiki et al, 2012). The MetaGeneMark (Zhu 
et al, 2010) annotation tool has been added for gene prediction 
in metagenomic sequences and a workflow has been developed 
for (bacterial) metatranscriptome analysis. We complete this sec- 
tion with instruments for data filtering, conversion and tax- 
onomy abundance displaying into the Krona visualizer (Ondov 
et al, 2011). 
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